New Technologies for Developmental Evolutionary Biology Studies
Let's load all the libraries we will use. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse,
DESeq2,
ggrepel,
plotly)
We have seen that Spadefoot toad tadpoles have a developmentally fixed dorso-ventral gradient in pigmentation, and that tadpoles can increase or decrease their pigmentation based on the background they are raised on. We now want to explore what are the underlying genetic mechanism that are driving these changes in phenotype. Here, we will jump in around about the middle of a typical RNAseq workflow. We will have already done the following:
Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:
Lets load the count data:
txi<-readRDS("../data/salmon_gene_counts.rds")
str(txi)
## List of 4
## $ abundance : num [1:32531, 1:12] 3.17 0 0 0 13.65 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ counts : num [1:32531, 1:12] 299 0 0 0 879 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ length : num [1:32531, 1:12] 6209.3 71.4 547.6 1019.4 4242.4 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ countsFromAbundance: chr "no"
For those of you unfamiliar, .rds files is a single R
data object, created with saveRDS(). It is particularly
useful for saving R objects that are not tabular, or more than
2-dimensional (e.g. lists, functions etc.).
Question: What does is the class and structure of this data object?
class(txi)
## [1] "list"
str(txi)
## List of 4
## $ abundance : num [1:32531, 1:12] 3.17 0 0 0 13.65 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ counts : num [1:32531, 1:12] 299 0 0 0 879 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ length : num [1:32531, 1:12] 6209.3 71.4 547.6 1019.4 4242.4 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:12] "7D" "7V" "8D" "8V" ...
## $ countsFromAbundance: chr "no"
lapply(X=txi, FUN=head)
## $abundance
## 7D 7V 8D 8V 10D 10V
## PECUL23A000004 3.172516 2.49415 1.846965 3.275358 9.258514 10.84399
## PECUL23A000005 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
## PECUL23A000006 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
## PECUL23A000008 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
## PECUL23A000011 13.646505 12.58050 11.662266 12.011888 20.974082 21.07977
## PECUL23A000012 0.000000 0.00000 0.000000 0.000000 0.000000 0.00000
## 11D 11V 21D 21V 22D 22V
## PECUL23A000004 2.008093 2.847134 10.00756 6.463992 7.079812 11.03426
## PECUL23A000005 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
## PECUL23A000006 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
## PECUL23A000008 0.000000 0.000000 0.00000 0.076718 0.000000 0.00000
## PECUL23A000011 13.069257 12.598137 17.31247 15.845097 18.138504 17.05364
## PECUL23A000012 0.000000 0.000000 0.00000 0.000000 0.000000 0.00000
##
## $counts
## 7D 7V 8D 8V 10D 10V 11D
## PECUL23A000004 299.200 222.206 172.249 303.135 798.556 950.925 209.388
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000011 879.316 728.819 724.642 713.794 1238.869 1267.971 898.400
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 11V 21D 21V 22D 22V
## PECUL23A000004 313.572 842.375 503.027 675.112 988.542
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 1.000 0.000 0.000
## PECUL23A000011 933.099 1005.042 858.488 1125.317 1044.139
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000
##
## $length
## 7D 7V 8D 8V 10D
## PECUL23A000004 6209.34168 6537.26970 6379.89021 6485.19773 6201.16983
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4242.39630 4250.94094 4250.63982 4163.96652 4246.70282
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 10V 11D 11V 21D 21V
## PECUL23A000004 6199.02702 6438.33500 6315.82807 6149.42485 6100.23764
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1021.78100
## PECUL23A000011 4252.16543 4244.47366 4247.39914 4241.13665 4247.13092
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 22D 22V
## PECUL23A000004 6360.54714 6205.29626
## PECUL23A000005 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872
## PECUL23A000011 4138.22686 4240.83801
## PECUL23A000012 447.71650 447.71650
##
## $countsFromAbundance
## [1] "no"
abundamce, counts, length) and 1
character vector (countsFromAbundance).Question: What is the content of this object and how is it arranged?
* All three arrays have target transcripts as rows and biological samples per column. * Counts: estimate of the number of reads mapping to each transcript. * Abundance: Raw counts cannot be compared across samples because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths. Salmon (the program we used to quantify reads) also produces an array of "Abundances" which are normalized counts. According to the salmon documention, this is Transcripts Per Million (TPM). This basically means that per sample, the total counts add up to 1 million. We could check this:
apply(X=txi$abundance, FUN=sum, MARGIN = 2)
## 7D 7V 8D 8V 10D 10V 11D 11V
## 933611.3 925431.3 931110.8 939398.7 932355.7 938520.5 945823.7 940997.4
## 21D 21V 22D 22V
## 930301.0 934942.5 930828.3 931353.4
In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.
tximport. Default is
no (counts are from salmon).I would argue that it is generally good practice to always plot the distributions of data. This is a good way to get to know your data and also to spot any potentially considerations we have give any analyses we may perform. Lets go ahead and plot the distributions of counts for each sample.
txi$counts %>%
as_tibble() %>% # turns matrix into tibble object class
pivot_longer(everything(), names_to="sample", values_to="counts") %>% # turns wide table to long table
mutate(counts=counts+1) %>% # we are going to plot on a log scale, so lets add 1 to each count so that log10(1) = 0
# filter(counts>1) %>%
ggplot(aes(counts)) +
geom_histogram() +
ylab("number of genes") +
scale_x_log10() + # plot x-axis on a log scale
facet_wrap(~sample) # facet each sample into its own plot
Question: What do these distributions tell us about our count data?
We will see later on that software for performing differential gene expression is quite good at dealing with non-normal, high zero-count data, but we can make our lives a little easier by removing at least those genes/loci that don't have any mapped reads.
# find rownames where all counts == 0
keepers<-rownames(txi$counts)[rowSums(txi$counts)>0]
length(keepers)
## [1] 21412
# keep all those rows
txi$counts<-txi$counts[keepers,]
txi$abundance<-txi$abundance[keepers,]
txi$length<-txi$length[keepers,]
#confirm rows have been dropped
dim(txi$counts)
## [1] 21412 12
To make any sort of inference, we need to add some additional information to the column names that tell us about the sample types and experimental conditions.
Let's load our design matrix. This is a .csv file where we have indicate which sample came from what treatment.
samples<-read.csv("../data/experiment_data.csv")
samples
## specimen_id background
## 1 7 white
## 2 8 white
## 3 9 white
## 4 10 black
## 5 11 black
## 6 12 black
## 7 21 white
## 8 22 black
This tells us which specimens were subjected to which background colour. We can use some of our Tidyverse knowledge from the previous tutorial to now match this with the column names from the gene expression data.
rna_samples<-tibble(
sample_id=colnames(txi$counts),
specimen_id=str_extract(sample_id, "\\d+") %>% as.numeric(),
position=str_extract(sample_id, "[A-Z]")
) %>%
left_join(samples)
rna_samples
## # A tibble: 12 × 4
## sample_id specimen_id position background
## <chr> <dbl> <chr> <chr>
## 1 7D 7 D white
## 2 7V 7 V white
## 3 8D 8 D white
## 4 8V 8 V white
## 5 10D 10 D black
## 6 10V 10 V black
## 7 11D 11 D black
## 8 11V 11 V black
## 9 21D 21 D white
## 10 21V 21 V white
## 11 22D 22 D black
## 12 22V 22 V black
Question: How many repliactes do we have per tissue type per background?
rna_samples %>%
group_by(position, background) %>%
tally()
## # A tibble: 4 × 3
## # Groups: position [2]
## position background n
## <chr> <chr> <int>
## 1 D black 3
## 2 D white 3
## 3 V black 3
## 4 V white 3
We can now combine the design matrix and the expression data into a
format that DESeq2 likes. tximport plays very well with
DESeq2, so this is easy!
# double check column names are the same order
colnames(txi$counts)==rna_samples$sample_id
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# make DESeq object
dds <- DESeqDataSetFromTximport(txi,
colData = rna_samples,
design = ~ position + background)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
The dataset should have successfully been stored as a
DESeqDataSet object. You should see that it has imported
the counts and the length data from the
txi object and the sample information from the design
matrix. lets take a look at this object:
dds
## class: DESeqDataSet
## dim: 21412 12
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(21412): PECUL23A000004 PECUL23A000008 ... PECUL23A062646
## PECUL23A062647
## rowData names(0):
## colnames(12): 7D 7V ... 22D 22V
## colData names(4): sample_id specimen_id position background
First up, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together based on their overall expression values. As it stands, we are dealing with a highly dimensional dataset though (each sample has >30k 'variables') and so a dimensional reduction technique like a Principal Component analysis might be a good idea.
Before we go any further however, we need to do a better job at normalizing our count data, than just making counts relative to 1 million per sample. This includes:
This second normalization may require a bit of explanation. You
should recall that the dds object contains not only the
counts, but also the transcript lengths.
assay(dds, "avgTxLength") %>% head()
## 7D 7V 8D 8V 10D 10V 11D
## PECUL23A000004 6209.342 6537.270 6379.890 6485.198 6201.170 6199.027 6438.335
## PECUL23A000008 1019.399 1019.399 1019.399 1019.399 1019.399 1019.399 1019.399
## PECUL23A000011 4242.396 4250.941 4250.640 4163.967 4246.703 4252.165 4244.474
## PECUL23A000017 1819.291 1752.458 1718.785 1703.185 1910.097 1954.068 2145.945
## PECUL23A000021 2250.962 2245.367 2256.992 2281.907 2307.002 2320.662 2249.317
## PECUL23A000024 2498.989 2388.780 2391.957 2395.681 2636.826 2553.543 2884.335
## 11V 21D 21V 22D 22V
## PECUL23A000004 6315.828 6149.425 6100.238 6360.547 6205.296
## PECUL23A000008 1019.399 1019.399 1021.781 1019.399 1019.399
## PECUL23A000011 4247.399 4241.137 4247.131 4138.227 4240.838
## PECUL23A000017 1796.677 1865.823 1916.210 1887.871 2010.417
## PECUL23A000021 2276.333 2261.980 2263.461 2274.141 2318.005
## PECUL23A000024 2438.032 2876.946 2883.781 2878.930 2425.739
This shows that the different targets (i.e. the loci that the RNA-seq reads were mapped to) are of very different lengths. Logically, a larger target is expected to also receive more mapped reads! This can greatly impact our results and needs to be accounted for.
One technique to perform both of these normalizations is to use a
Variance Stabilising Transformation. For this we need both
the counts and the target lengths, and so conveniently, we can apply
this directly to the dds object we have created.
dds_vst<-vst(dds)
assay(dds_vst) %>% head()
## 7D 7V 8D 8V 10D 10V
## PECUL23A000004 8.408832 8.016173 7.795458 8.508675 9.698203 10.030478
## PECUL23A000008 4.834151 4.834151 4.834151 4.834151 4.834151 4.834151
## PECUL23A000011 9.782295 9.553379 9.594917 9.677563 10.270035 10.397357
## PECUL23A000017 7.847992 8.222655 8.381942 8.201588 8.050771 7.857424
## PECUL23A000021 8.691870 8.828108 9.012648 9.062747 8.649458 8.688747
## PECUL23A000024 5.879286 5.604807 5.993232 6.195822 7.348643 7.774410
## 11D 11V 21D 21V 22D 22V
## PECUL23A000004 8.126095 8.584872 9.859782 9.247662 9.393828 9.968497
## PECUL23A000008 4.834151 4.834151 4.834151 5.111986 4.834151 4.834151
## PECUL23A000011 10.025821 10.012554 10.065096 9.914462 10.126599 10.018931
## PECUL23A000017 7.592364 7.807522 9.382773 8.858247 8.286428 8.111973
## PECUL23A000021 8.876448 8.818017 8.692063 8.664252 8.707617 8.606508
## PECUL23A000024 5.917729 6.440581 7.098071 5.770632 6.690472 6.443489
We can now pass this transformed counts data to a canned PCA function. We are going to use our Tidyverse knowledge however, to make the plot a bit more informative.
# standard DESeq2 PCA output top 500 most variable genes
plotPCA(dds_vst, intgroup=c("position", "background"), ntop=500)
# run PCA, but output the data, not the plot
pcaData<-plotPCA(dds_vst, intgroup=c("position", "background"), ntop=500, returnData=TRUE)
# extract percentage variance per PC axis
percentVar <- round(100 * attr(pcaData, "percentVar"))
# make a nice pca plot
ggplot(pcaData, aes(PC1, PC2, color=background, shape=position)) +
geom_point(size=4) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
coord_fixed() +
scale_color_manual(values=c("black","white"))+
theme_dark() +
ggtitle("PCA on 500 most variable genes")
Question: Can we see any patterns in the PCA biplot?
The PCA can be useful to detect any outliers or spot potential problems with our data, but it is not an official statistical test. Our ultimate goal is to see if a given gene is expressed to a different degree in one condition vs. another condition. Ideally we want to do that for all genes at once, and this is the aim of a differential gene expression analysis. At it's core this sounds almost like a T-test kind of problem, but things are not that simple. In this exercise, I don't want to focus too much on the theory, but it is worthwhile to briefly discuss some of the "problematic" features of our data, that would make fitting a simple statistical model difficult. Some of this we have already seen.
The differential expression analysis in DESeq2 takes
several steps to try to deal with these problems. At its core it fits
generalized linear models with a negative binomial distribution to each
gene. It takes into account the estimated counts per gene, normalized by
library size, but also takes into account the size of the target
genes/transcripts and a dispersion parameter. This dispersion parameter
is calculated using all genes, even though the differential expression
test is later fitted to only a single gene at a time.
It then performs a pairwise comparison of the level of expression per gene in condition A vs. condition B. It uses the models per gene to estimate coefficients and standard error for each sample group (i.e. for condition A and condition B). These coefficients are the estimates of log2 fold change.
To test whether this log2 fold change for a given gene is significant, it then applies a statistical test (Wald test), to test the hypothesis that the parameter (log 2 fold change) is significantly different from 0. A multiple test correction is then applied, by default a version of the Benjamini-Hochberg False Discovery Rate (FDR). This is done by ranking the genes by p-value, then:
\(adjusted\ p\ value = p\ value * \frac{total\ number\ of\ tests}{rank}\)
In short, DESeq2 does the following:
These are all done with a single convenient function:
dds <- DESeq(dds)
dds
## class: DESeqDataSet
## dim: 21412 12
## metadata(1): version
## assays(6): counts avgTxLength ... H cooks
## rownames(21412): PECUL23A000004 PECUL23A000008 ... PECUL23A062646
## PECUL23A062647
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(12): 7D 7V ... 22D 22V
## colData names(4): sample_id specimen_id position background
Note: it looks like we are "overwriting" the
ddsobject, but really, we are just adding more inforamtion to it
This incredibly convenient function has done all of the heavy lifting
for us, and it has even run some pair-wise tests already, based on the
~position + background variables we set.
We can access the names of the pairwise tests like so:
resultsNames(dds)
## [1] "Intercept" "position_V_vs_D"
## [3] "background_white_vs_black"
We can extract the results for the pair-wise tests from the
dds object like so:
# dorsal versus ventral
res_pos<-results(dds, name = "position_V_vs_D")
res_pos
## log2 fold change (MLE): position V vs D
## Wald test p-value: position V vs D
## DataFrame with 21412 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004 522.8802192 0.08596494 0.442275 0.1943698 0.845886
## PECUL23A000008 0.0884324 0.52069116 3.116540 0.1670735 0.867312
## PECUL23A000011 950.0207852 -0.05010279 0.127773 -0.3921240 0.694967
## PECUL23A000017 261.7948765 -0.17473465 0.289523 -0.6035266 0.546158
## PECUL23A000021 384.9774416 0.00797549 0.119486 0.0667484 0.946782
## ... ... ... ... ... ...
## PECUL23A062639 12.689338 -0.2622702 0.487949 -0.537495 0.590925
## PECUL23A062642 0.320325 -1.4628909 3.092154 -0.473098 0.636143
## PECUL23A062644 1.174657 -0.8341345 1.190627 -0.700584 0.483563
## PECUL23A062646 7.013169 -0.3679098 0.871624 -0.422097 0.672954
## PECUL23A062647 108.200574 -0.0261807 0.248048 -0.105547 0.915942
## padj
## <numeric>
## PECUL23A000004 0.99974
## PECUL23A000008 NA
## PECUL23A000011 0.99974
## PECUL23A000017 0.99974
## PECUL23A000021 0.99974
## ... ...
## PECUL23A062639 0.99974
## PECUL23A062642 NA
## PECUL23A062644 NA
## PECUL23A062646 0.99974
## PECUL23A062647 0.99974
# black vs. white
res_bg<-results(dds, name = "background_white_vs_black")
res_bg
## log2 fold change (MLE): background white vs black
## Wald test p-value: background white vs black
## DataFrame with 21412 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004 522.8802192 -0.685910 0.442276 -1.550866 0.1209338
## PECUL23A000008 0.0884324 0.487485 3.116540 0.156419 0.8757031
## PECUL23A000011 950.0207852 -0.396665 0.127775 -3.104393 0.0019067
## PECUL23A000017 261.7948765 0.726795 0.289528 2.510272 0.0120638
## PECUL23A000021 384.9774416 0.122220 0.119487 1.022874 0.3063675
## ... ... ... ... ... ...
## PECUL23A062639 12.689338 -0.196683 0.487876 -0.403141 0.68684432
## PECUL23A062642 0.320325 1.510412 3.092154 0.488466 0.62521986
## PECUL23A062644 1.174657 -0.820934 1.189680 -0.690046 0.49016535
## PECUL23A062646 7.013169 -0.475565 0.871688 -0.545568 0.58536277
## PECUL23A062647 108.200574 0.651857 0.248119 2.627195 0.00860921
## padj
## <numeric>
## PECUL23A000004 0.3095127
## PECUL23A000008 NA
## PECUL23A000011 0.0184354
## PECUL23A000017 0.0687015
## PECUL23A000021 0.5423588
## ... ...
## PECUL23A062639 0.8399464
## PECUL23A062642 NA
## PECUL23A062644 NA
## PECUL23A062646 0.7719239
## PECUL23A062647 0.0537317
These tables have the following information:
Question: Is there anything unexpected about the adjusted p-values?
Some of the adjusted p-values are NA. This is what the
DESeq2 handbook tells us:
"If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below".
Let's check some of those counts with NA padj:
counts(dds)[rownames(res_pos)[is.na(res_pos$padj)],] %>% head(15)
## 7D 7V 8D 8V 10D 10V 11D 11V 21D 21V 22D 22V
## PECUL23A000008 0 0 0 0 0 0 0 0 0 1 0 0
## PECUL23A000029 0 0 1 0 0 0 0 0 0 0 0 0
## PECUL23A000056 0 0 0 0 0 0 0 0 2 0 2 2
## PECUL23A000063 0 0 0 0 0 0 0 0 0 0 1 1
## PECUL23A000099 1 0 0 1 0 0 0 3 1 0 1 2
## PECUL23A000126 1 0 0 0 0 0 2 0 1 1 1 0
## PECUL23A000130 1 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000195 0 0 0 0 0 0 2 0 0 0 0 1
## PECUL23A000197 1 2 0 0 1 2 0 2 6 2 2 0
## PECUL23A000266 0 2 3 0 1 0 3 0 0 0 0 0
## PECUL23A000277 0 0 0 0 1 0 0 0 0 0 0 0
## PECUL23A000289 7 11 6 14 12 268 3 12 8 10 5 14
## PECUL23A000302 0 0 0 0 0 1 0 0 1 1 0 1
## PECUL23A000303 1 2 0 0 0 0 0 0 0 0 0 0
## PECUL23A000313 0 0 0 1 0 0 0 0 0 0 0 0
Indeed! all have very low counts!! they are therefore unreliable, and should rightfully not be included in the results.
We can now summarize these results tables:
summary(res_pos, alpha=0.05)
##
## out of 21363 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 324, 1.5%
## LFC < 0 (down) : 342, 1.6%
## outliers [1] : 160, 0.75%
## low counts [2] : 4548, 21%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(res_bg, alpha=0.05)
##
## out of 21363 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 1537, 7.2%
## LFC < 0 (down) : 936, 4.4%
## outliers [1] : 160, 0.75%
## low counts [2] : 5366, 25%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
This returns the number of up-regulated genes (LFC>0) and down-regulated genes (LFC<0) with respect to the control variable. Other information is also given, such as the number of outliers and low counts.
Question: What do you notice about these results?
A plot resulting from a differential gene expression analysis that you may also have seen is a "Volcano" plot. These show the log fold change plotted against the adjusted p-values per gene. To make small p-values (padj) very large, the p-values are usually -log10() transformed. The most basic plot would look like this:
res_pos %>%
as_tibble(rownames = "gene_id") %>%
ggplot(aes(x=log2FoldChange, y=-log10(padj))) +
geom_point(alpha=0.75, shape=16) +
ggtitle("Dorsum vs. Ventrum")
## Warning: Removed 4757 rows containing missing values or values outside the scale range
## (`geom_point()`).
In such a plot, each dot represents a gene and the dots with positive fold changes (x-axis) are genes that are over expressed in the treatment versus the control and the genes with negative fold changes are those that are underexpressed. As in this case, we don't always have a clear control versus treatment set up, but DESeq2 will always take second contrast as the control. In our case this was the dorsal skin. This means that all genes on the left of 0 are over expressed in the dorsal skin and all genes to the right of 0 are over expressed in the ventral skin.
The degree of significance (adjusted p-value) is shown on the y-axis. This higher up genes are, the more significant their differential expression.
We can tweak this a little bit to highlight just those genes that are highly differentially expressed. Let's put a p-value threshold at <0.05 and a log fold change threshold at >2.
res_pos %>%
as_tibble(rownames = "gene_id") %>%
drop_na(padj) %>% # drop all genes with NAs
filter(padj<0.5) %>% # reduce the number of points that need to be plotted
mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
geom_point(alpha=0.75, shape=16) +
ggtitle("Dorsum vs. Ventrum") +
theme_bw()
Of course, now we would really like to know what those genes are! So far, we have been working with gene IDs that refer to identified transcripts/genes in the Pelobates cultripes genome assembly, but they have little real meaning to us. How to obtain annotations is a topic in itself and ideally would come from functional studies. However, such studies tend to only be very extensive for model systems and so we often rely in homology. I.e. we will assume that gene sequences with similar amino acid configurations will have similar functions. We can therefore use sequence similarity tools, like BLAST to map genes and their annotations from model organisms (in our case Xenopus tropicalis) onto our genome. Lets import these annotations.
annotations<-read_csv("../data/annotations.csv")
annotations
## # A tibble: 24,392 × 4
## gene_id xen_pep_id xen_gene_symbol xen_description
## <chr> <chr> <chr> <chr>
## 1 PECUL23A000004 ENSXETP00000113033 kiaa1522 KIAA1522 [Sour…
## 2 PECUL23A000005 ENSXETP00000118270 <NA> <NA>
## 3 PECUL23A000006 ENSXETP00000117992 <NA> <NA>
## 4 PECUL23A000008 ENSXETP00000100780 pltp phospholipid t…
## 5 PECUL23A000011 ENSXETP00000042154;ENSXETP000… fxr1 FMR1 autosomal…
## 6 PECUL23A000012 ENSXETP00000115735 clasrp CLK4-associati…
## 7 PECUL23A000017 ENSXETP00000041999 lyg2 lysozyme G-lik…
## 8 PECUL23A000018 ENSXETP00000105337 <NA> <NA>
## 9 PECUL23A000019 ENSXETP00000079593 krt34 keratin 34 [So…
## 10 PECUL23A000021 ENSXETP00000033990;ENSXETP000… pdzd2;znf207 PDZ domain con…
## # ℹ 24,382 more rows
Question: Do all of our genes have homologous genes in X. tropicals? And are they all annotatated? What does this mean?
annotations %>%
filter(is.na(xen_pep_id)) %>%
nrow()
## [1] 251
annotations %>%
filter(is.na(xen_gene_symbol)) %>%
nrow()
## [1] 5623
Lets add these annotations to our plot. We could do this by adding text label to only our significant genes
res_pos %>%
as_tibble(rownames = "gene_id") %>%
drop_na(padj) %>% # drop all genes with NAs
filter(padj<0.5) %>% # reduce the number of points that need to be plotted
mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) +# add label text
geom_point(alpha=0.75, shape=16) +
geom_text(data= . %>% filter(significant)) + # add text labels for only the significant genes
ggtitle("Dorsum vs. Ventrum") +
theme_bw() +
theme(legend.position="none")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_text()`).
This is a little messy. We could try and clean it up by having the text labels repel each other:
res_pos %>%
as_tibble(rownames = "gene_id") %>%
drop_na(padj) %>% # drop all genes with NAs
filter(padj<0.5) %>% # reduce the number of points that need to be plotted
mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) +
geom_point(alpha=0.75, shape=16) +
geom_text_repel(data= . %>% filter(significant)) +
ggtitle("Dorsum vs. Ventrum") +
theme_bw() +
theme(legend.position="none")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
## Warning: ggrepel: 73 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
This is already much better, but geom_text_repel() has
decided to drop labels for points that are too close to each other. An
interesting alternative is therefore to make this plot interactive, so
that labels are only shown when we hover over points. This is easier to
do than might think with the javascript package plotly.
# save the plot as a ggplot object
p1<-res_pos %>%
as_tibble(rownames = "gene_id") %>%
drop_na(padj) %>% # drop all genes with NAs
filter(padj<0.5) %>% # reduce the number of points that need to be plotted
mutate(significant= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
left_join(annotations) %>% # pull across the extra columns, based on a "gene_id" match
ggplot(aes(x=log2FoldChange, y=-log10(padj), color=significant, label=xen_gene_symbol)) +
geom_point(alpha=0.75, shape=16) +
ggtitle("Dorsum vs. Ventrum") +
theme_bw() +
theme(legend.position="none")
# convert the ggplot to a plotly object
ggplotly(p1, tooltip = "label") # define the "label" aesthetic as the hover text
To get back to our story at hand though, lets make one final plot, where we show volcanoes for both comparisons (dorsal versus ventral and black versus white), and lets add only those annotations that are related to the photosensitive melanin synthesis pathways.
# load a reduced annotation file
melanin_genes<-read_csv("../data/melanin_genes.csv")
melanin_genes
## # A tibble: 14 × 3
## gene_id gene_symbol gene_name
## <chr> <chr> <chr>
## 1 PECUL23A055459 mc1r melanocortin-1 receptor
## 2 PECUL23A019284 asip agouti-signaling
## 3 PECUL23A040111 mitf Melanocyte Inducing Transcription Factor
## 4 PECUL23A015118 sox10 SRY-box 10
## 5 PECUL23A023288 pax3 paired box 3
## 6 PECUL23A000301 tyr tyrosinase
## 7 PECUL23A011713 dct dopachrome tautomerase
## 8 PECUL23A038095 tyrp1 tyrosinase-related protein 1
## 9 PECUL23A009622 pmel premelanosome protein
## 10 PECUL23A034305 oca2 oculocutaneous albinism II
## 11 PECUL23A039712 slc24a5 solute carrier family 24 member 5
## 12 PECUL23A053655 slc45a2 solute carrier family 45 member 2
## 13 PECUL23A047519 mlph melanophilin
## 14 PECUL23A043421 mlana melan-A
# combine the two DESeq results tables
res<-bind_rows(res_pos %>%
as_tibble(rownames="gene_id") %>%
mutate(contrast="Dorsal vs. Ventral"),
res_bg %>%
as_tibble(rownames="gene_id") %>%
mutate(contrast="Black vs. White"))
# plot
res %>%
drop_na(padj) %>%
# filter(padj<0.5) %>%
mutate(sig= padj<0.05) %>% # lets be a bit more relaxed and not put a threshold on the fold change
left_join(melanin_genes) %>%
ggplot(aes(x=log2FoldChange, y=-log10(padj), color=sig, label=gene_symbol)) +
geom_point(alpha=0.75, shape=16) +
geom_text_repel(max.overlaps = Inf) + # set the maximum overlap to infinity, to show all labels.
xlim(-15,15) + # fix axis width
facet_wrap(~contrast) + # facet based on contrast
theme_minimal() +
scale_color_manual(values=c("grey","black")) +
theme(legend.position = "none")
## Warning: Removed 32464 rows containing missing values or values outside the scale range
## (`geom_text_repel()`).
Lets look at our gene pathway schematics again and discuss the results.
Question: What genes from these pathways did we recover and are they expressed in a way that we expect?
As a last bonus, to match our phenotype plots, we can also show our expression data as reaction norms. For this, we can use the normalized expression values from the VST matrix.
assay(dds_vst) %>%
as_tibble(rownames="gene_id") %>%
left_join(melanin_genes) %>% # add annotations
drop_na(gene_symbol) %>% # keep only genes with annotations
select(gene_id, gene_symbol, where(is.numeric)) %>%
pivot_longer(-c(gene_id, gene_symbol), names_to="sample_id", values_to = "expression") %>%
left_join(rna_samples) %>% # pull in experimental info about samples
# instead of calculating means and standard deviations, we will use the stat_summary() layer from ggplot to calculate these on the fly.
ggplot(aes(x=position, y=expression, color=background)) +
stat_summary(aes(group = background), fun = mean, geom = "path") +
stat_summary(aes(color = background), fun.data = mean_cl_boot, geom = "errorbar", width = 0.1) +
stat_summary(aes(color = background), fun= mean, geom = "point", size = 2) +
scale_color_manual(values = c("black"="black","white"="grey80")) +
facet_wrap(~gene_symbol) +
theme_bw() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) # remove grid lines